Computation and Language 74
☆ Searching for Privacy Risks in LLM Agents via Simulation
The widespread deployment of LLM-based agents is likely to introduce a
critical privacy threat: malicious agents that proactively engage others in
multi-turn interactions to extract sensitive information. These dynamic
dialogues enable adaptive attack strategies that can cause severe privacy
violations, yet their evolving nature makes it difficult to anticipate and
discover sophisticated vulnerabilities manually. To tackle this problem, we
present a search-based framework that alternates between improving attacker and
defender instructions by simulating privacy-critical agent interactions. Each
simulation involves three roles: data subject, data sender, and data recipient.
While the data subject's behavior is fixed, the attacker (data recipient)
attempts to extract sensitive information from the defender (data sender)
through persistent and interactive exchanges. To explore this interaction space
efficiently, our search algorithm employs LLMs as optimizers, using parallel
search with multiple threads and cross-thread propagation to analyze simulation
trajectories and iteratively propose new instructions. Through this process, we
find that attack strategies escalate from simple direct requests to
sophisticated multi-turn tactics such as impersonation and consent forgery,
while defenses advance from rule-based constraints to identity-verification
state machines. The discovered attacks and defenses transfer across diverse
scenarios and backbone models, demonstrating strong practical utility for
building privacy-aware agents.
comment: Preprint
☆ A Survey on Diffusion Language Models
Diffusion Language Models (DLMs) are rapidly emerging as a powerful and
promising alternative to the dominant autoregressive (AR) paradigm. By
generating tokens in parallel through an iterative denoising process, DLMs
possess inherent advantages in reducing inference latency and capturing
bidirectional context, thereby enabling fine-grained control over the
generation process. While achieving a several-fold speed-up, recent
advancements have allowed DLMs to show performance comparable to their
autoregressive counterparts, making them a compelling choice for various
natural language processing tasks. In this survey, we provide a holistic
overview of the current DLM landscape. We trace its evolution and relationship
with other paradigms, such as autoregressive and masked language models, and
cover both foundational principles and state-of-the-art models. Our work offers
an up-to-date, comprehensive taxonomy and an in-depth analysis of current
techniques, from pre-training strategies to advanced post-training methods.
Another contribution of this survey is a thorough review of DLM inference
strategies and optimizations, including improvements in decoding parallelism,
caching mechanisms, and generation quality. We also highlight the latest
approaches to multimodal extensions of DLMs and delineate their applications
across various practical scenarios. Furthermore, our discussion addresses the
limitations and challenges of DLMs, including efficiency, long-sequence
handling, and infrastructure requirements, while outlining future research
directions to sustain progress in this rapidly evolving field. Project GitHub
is available at https://github.com/VILA-Lab/Awesome-DLMs.
☆ SSRL: Self-Search Reinforcement Learning
Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou
We investigate the potential of large language models (LLMs) to serve as
efficient simulators for agentic search tasks in reinforcement learning (RL),
thereby reducing dependence on costly interactions with external search
engines. To this end, we first quantify the intrinsic search capability of LLMs
via structured prompting and repeated sampling, which we term Self-Search. Our
results reveal that LLMs exhibit strong scaling behavior with respect to the
inference budget, achieving high pass@k on question-answering benchmarks,
including the challenging BrowseComp task. Building on these observations, we
introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability
through format-based and rule-based rewards. SSRL enables models to iteratively
refine their knowledge utilization internally, without requiring access to
external tools. Empirical evaluations demonstrate that SSRL-trained policy
models provide a cost-effective and stable environment for search-driven RL
training, reducing reliance on external search engines and facilitating robust
sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world
knowledge that can be effectively elicited to achieve high performance; 2) SSRL
demonstrates the potential of leveraging internal knowledge to reduce
hallucination; 3) SSRL-trained models integrate seamlessly with external search
engines without additional effort. Our findings highlight the potential of LLMs
to support more scalable RL agent training.
☆ From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms
Recent advancements in machine learning have spurred growing interests in
automated interpreting quality assessment. Nevertheless, existing research
suffers from insufficient examination of language use quality, unsatisfactory
modeling effectiveness due to data scarcity and imbalance, and a lack of
efforts to explain model predictions. To address these gaps, we propose a
multi-dimensional modeling framework that integrates feature engineering, data
augmentation, and explainable machine learning. This approach prioritizes
explainability over ``black box'' predictions by utilizing only
construct-relevant, transparent features and conducting Shapley Value (SHAP)
analysis. Our results demonstrate strong predictive performance on a novel
English-Chinese consecutive interpreting dataset, identifying BLEURT and
CometKiwi scores to be the strongest predictive features for fidelity,
pause-related features for fluency, and Chinese-specific phraseological
diversity metrics for language use. Overall, by placing particular emphasis on
explainability, we present a scalable, reliable, and transparent alternative to
traditional human evaluation, facilitating the provision of detailed diagnostic
feedback for learners and supporting self-regulated learning advantages not
afforded by automated scores in isolation.
☆ Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning
Amidst a shortage of qualified mental health professionals, the integration
of large language models (LLMs) into psychological applications offers a
promising way to alleviate the growing burden of mental health disorders.
Recent reasoning-augmented LLMs have achieved remarkable performance in
mathematics and programming, while research in the psychological domain has
predominantly emphasized emotional support and empathetic dialogue, with
limited attention to reasoning mechanisms that are beneficial to generating
reliable responses. Therefore, in this paper, we propose Psyche-R1, the first
Chinese psychological LLM that jointly integrates empathy, psychological
expertise, and reasoning, built upon a novel data curation pipeline.
Specifically, we design a comprehensive data synthesis pipeline that produces
over 75k high-quality psychological questions paired with detailed rationales,
generated through chain-of-thought (CoT) reasoning and iterative
prompt-rationale optimization, along with 73k empathetic dialogues.
Subsequently, we employ a hybrid training strategy wherein challenging samples
are identified through a multi-LLM cross-selection strategy for group relative
policy optimization (GRPO) to improve reasoning ability, while the remaining
data is used for supervised fine-tuning (SFT) to enhance empathetic response
generation and psychological domain knowledge. Extensive experiment results
demonstrate the effectiveness of the Psyche-R1 across several psychological
benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B
DeepSeek-R1.
☆ Reinforced Language Models for Sequential Decision Making
Large Language Models (LLMs) show potential as sequential decision-making
agents, but their application is often limited due to a reliance on large,
computationally expensive models. This creates a need to improve smaller
models, yet existing post-training methods are designed for single-turn
interactions and cannot handle credit assignment in multi-step agentic tasks.
To address this, we introduce Multi-Step Group-Relative Policy Optimization
(MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal
Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP)
frameworks. For credit assignment, MS-GRPO attributes the entire cumulative
episode reward to each individual episode step. We supplement this algorithm
with a novel absolute-advantage-weighted episode sampling strategy that we show
improves training performance. We evaluate our approach by post-training a
3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate
that the method is effective in improving decision-making performance: our
post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on
the Frozen Lake task. This work demonstrates that targeted post-training is a
practical and efficient alternative to relying on model scale for creating
sequential decision-making agents using LLMs.
☆ Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Technical Solutions
Memory is fundamental to intelligence, enabling learning, reasoning, and
adaptability across biological and artificial systems. While Transformer
architectures excel at sequence modeling, they face critical limitations in
long-range context retention, continual learning, and knowledge integration.
This review presents a unified framework bridging neuroscience principles,
including dynamic multi-timescale memory, selective attention, and
consolidation, with engineering advances in Memory-Augmented Transformers. We
organize recent progress through three taxonomic dimensions: functional
objectives (context extension, reasoning, knowledge integration, adaptation),
memory representations (parameter-encoded, state-based, explicit, hybrid), and
integration mechanisms (attention fusion, gated control, associative
retrieval). Our analysis of core memory operations (reading, writing,
forgetting, and capacity management) reveals a shift from static caches toward
adaptive, test-time learning systems. We identify persistent challenges in
scalability and interference, alongside emerging solutions including
hierarchical buffering and surprise-gated updates. This synthesis provides a
roadmap toward cognitively-inspired, lifelong-learning Transformer
architectures.
☆ Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
Novelty assessment is a central yet understudied aspect of peer review,
particularly in high volume fields like NLP where reviewer capacity is
increasingly strained. We present a structured approach for automated novelty
evaluation that models expert reviewer behavior through three stages: content
extraction from submissions, retrieval and synthesis of related work, and
structured comparison for evidence based assessment. Our method is informed by
a large scale analysis of human written novelty reviews and captures key
patterns such as independent claim verification and contextual reasoning.
Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty
assessments, the approach achieves 86.5% alignment with human reasoning and
75.3% agreement on novelty conclusions - substantially outperforming existing
LLM based baselines. The method produces detailed, literature aware analyses
and improves consistency over ad hoc reviewer judgments. These results
highlight the potential for structured LLM assisted approaches to support more
rigorous and transparent peer review without displacing human expertise. Data
and code are made available.
☆ Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models
Reinforcement learning with verifiable rewards (RLVR), which typically adopts
Pass@1 as the reward, has faced the issues in balancing exploration and
exploitation, causing policies to prefer conservative actions, converging to a
local optimum. Identifying an appropriate reward metric is therefore crucial.
Regarding the prior work, although Pass@k has been used in evaluation, its
connection to LLM exploration ability in RLVR remains largely overlooked. To
investigate this, we first use Pass@k as the reward to train the policy model
(i.e., $\textbf{Pass@k Training}$), and observe the improvement on its
exploration ability. Next, we derive an analytical solution for the advantage
of Pass@k Training, leading to an efficient and effective process. Building on
this, our analysis reveals that exploration and exploitation are not inherently
conflicting objectives, while they can mutually enhance each other. Moreover,
Pass@k Training with analytical derivation essentially involves directly
designing the advantage function. Inspired by this, we preliminarily explore
the advantage design for RLVR, showing promising results and highlighting a
potential future direction.
comment: Technical Report about RLVR: 32 pages, 18 figures, 7 tables
★ Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs
Despite large language models (LLMs) have achieved remarkable success, their
prefix-only prompting paradigm and sequential generation process offer limited
flexibility for bidirectional information. Diffusion large language models
(dLLMs) present new opportunities through their bidirectional attention
mechanisms and iterative refinement processes, enabling more flexible in-place
prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting
with Early Exit), a novel framework that transforms prefix-only prompting into
in-place prompting specifically designed for dLLMs. ICE integrates in-place
prompts directly within masked token positions during iterative refinement and
employs a confidence-aware early exit mechanism to significantly reduce
computational overhead. Extensive experiments demonstrate ICE's effectiveness,
achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K,
and up to 276.67$\times$ acceleration on MMLU while maintaining competitive
performance.
☆ Learning from Natural Language Feedback for Personalized Question Answering
Personalization is crucial for enhancing both the effectiveness and user
satisfaction of language technologies, particularly in information-seeking
tasks like question answering. Current approaches for personalizing large
language models (LLMs) often rely on retrieval-augmented generation (RAG),
followed by reinforcement learning with scalar reward signals to teach models
how to use retrieved personal context. We believe that these scalar rewards
sometimes provide weak, non-instructive feedback, limiting learning efficiency
and personalization quality. We introduce VAC, a novel framework for
personalized response generation that replaces scalar rewards with natural
language feedback (NLF) that are generated conditioned on the user profiles and
the question narratives. NLF serves as a rich and actionable supervision
signal, allowing the policy model to iteratively refine its outputs and
internalize effective personalization strategies. Training alternates between
optimizing the feedback model and fine-tuning the policy model on the improved
responses, resulting in a policy model that no longer requires feedback at
inference. Evaluation on the LaMP-QA benchmark that consists of three diverse
domains demonstrates consistent and significant improvements over the
state-of-the-art results. Human evaluations further confirm the superior
quality of the generated responses. These results demonstrate that NLF provides
more effective signals for optimizing personalized question answering.
☆ Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph
Millions of individuals worldwide are affected by deafness and hearing
impairment. Sign language serves as a sophisticated means of communication for
the deaf and hard of hearing. However, in societies that prioritize spoken
languages, sign language often faces underestimation, leading to communication
barriers and social exclusion. The Continuous Bangla Sign Language Translation
project aims to address this gap by enhancing translation methods. While recent
approaches leverage transformer architecture for state-of-the-art results, our
method integrates graph-based methods with the transformer architecture. This
fusion, combining transformer and STGCN-LSTM architectures, proves more
effective in gloss-free translation. Our contributions include architectural
fusion, exploring various fusion strategies, and achieving a new
state-of-the-art performance on diverse sign language datasets, namely
RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach
demonstrates superior performance compared to current translation outcomes
across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01,
2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in
RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce
benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a
benchmark for future research, emphasizing the importance of gloss-free
translation to improve communication accessibility for the deaf and hard of
hearing.
☆ Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages
This paper presents the first systematic study of strategies for translating
Coptic into French. Our comprehensive pipeline systematically evaluates: pivot
versus direct translation, the impact of pre-training, the benefits of
multi-version fine-tuning, and model robustness to noise. Utilizing aligned
biblical corpora, we demonstrate that fine-tuning with a stylistically-varied
and noise-aware training corpus significantly enhances translation quality. Our
findings provide crucial practical insights for developing translation tools
for historical languages in general.
☆ eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM
Irma Heithoff. Marc Guggenberger, Sandra Kalogiannis, Susanne Mayer, Fabian Maag, Sigurd Schacht, Carsten Lanquillon
This paper presents a feasibility study on the deployment of a European Deep
Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support
mechanistic interpretability research on large language models. The need for
widespread accessibility of LLM interpretability infrastructure in Europe
drives this initiative to democratize advanced model analysis capabilities for
the research community. The project introduces a GPU-based cluster hosted at
Ansbach University of Applied Sciences and interconnected with partner
institutions, enabling remote model inspection via the NNsight API. A
structured pilot study involving 16 researchers from across Europe evaluated
the platform's technical performance, usability, and scientific utility. Users
conducted interventions such as activation patching, causal tracing, and
representation analysis on models including GPT-2 and DeepSeek-R1-70B. The
study revealed a gradual increase in user engagement, stable platform
performance throughout, and a positive reception of the remote experimentation
capabilities. It also marked the starting point for building a user community
around the platform. Identified limitations such as prolonged download
durations for activation data as well as intermittent execution interruptions
are addressed in the roadmap for future development. This initiative marks a
significant step towards widespread accessibility of LLM interpretability
infrastructure in Europe and lays the groundwork for broader deployment,
expanded tooling, and sustained community collaboration in mechanistic
interpretability research.
comment: 9 pages
☆ When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
capabilities across a diverse range of multimodal tasks. However, these models
suffer from a core problem known as text dominance: they depend heavily on text
for their inference, while underutilizing other modalities. While prior work
has acknowledged this phenomenon in vision-language tasks, often attributing it
to data biases or model architectures. In this paper, we conduct the first
systematic investigation of text dominance across diverse data modalities,
including images, videos, audio, time-series, and graphs. To measure this
imbalance, we propose two evaluation metrics: the Modality Dominance Index
(MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis
reveals that text dominance is both significant and pervasive across all tested
modalities. Our in-depth analysis identifies three underlying causes: attention
dilution from severe token redundancy in non-textual modalities, the influence
of fusion architecture design, and task formulations that implicitly favor
textual inputs. Furthermore, we propose a simple token compression method that
effectively rebalances model attention. Applying this method to LLaVA-7B, for
instance, drastically reduces its MDI from 10.23 to a well-balanced value of
0.86. Our analysis and methodological framework offer a foundation for the
development of more equitable and comprehensive multimodal language models.
☆ Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards
Reward sparsity in long-horizon reinforcement learning (RL) tasks remains a
significant challenge, while existing outcome-based reward shaping struggles to
define meaningful immediate rewards without introducing bias or requiring
explicit task decomposition. Alternatively, verification-based reward shaping
uses stepwise critics, but misalignment between immediate rewards and long-term
objectives can lead to reward hacking and suboptimal policies. In this work, we
address this problem in the context of software engineering (SWE) tasks, where
multi-turn reasoning and rule-based verification are critical. We introduce the
SWE-oriented RL Framework, a unified system supporting multi-turn interaction,
docker-based execution, and customizable reward functions. Additionally, we
propose Gated Reward Accumulation (G-RA), a novel method that accumulates
immediate rewards only when high-level (long-term) rewards meet a predefined
threshold, ensuring stable RL optimization. Experiments on SWE-bench Verified
and kBench demonstrate that G-RA leads to an increase in completion rates
(47.6\% \rightarrow 93.8\% and 22.0\% \rightarrow 86.0\%) and modification
rates (19.6\% \rightarrow 23.8\% and 12.0\% \rightarrow 42.0\%), while avoiding
policy degradation caused by reward misalignment. Our findings highlight the
importance of balanced reward accumulation in long-horizon RL and provide a
practical solution.
☆ Improving Value-based Process Verifier via Low-Cost Variance Reduction
Large language models (LLMs) have achieved remarkable success in a wide range
of tasks. However, their reasoning capabilities, particularly in complex
domains like mathematics, remain a significant challenge. Value-based process
verifiers, which estimate the probability of a partial reasoning chain leading
to a correct solution, are a promising approach for improving reasoning.
Nevertheless, their effectiveness is often hindered by estimation error in
their training annotations, a consequence of the limited number of Monte Carlo
(MC) samples feasible due to the high cost of LLM inference. In this paper, we
identify that the estimation error primarily arises from high variance rather
than bias, and the MC estimator is a Minimum Variance Unbiased Estimator
(MVUE). To address the problem, we propose the \textsc{Com}pound \textsc{M}onte
\textsc{C}arlo \textsc{S}ampling (ComMCS) method, which constructs an unbiased
estimator by linearly combining the MC estimators from the current and
subsequent steps. Theoretically, we show that our method leads to a predictable
reduction in variance, while maintaining an unbiased estimation without
additional LLM inference cost. We also perform empirical experiments on the
MATH-500 and GSM8K benchmarks to demonstrate the effectiveness of our method.
Notably, ComMCS outperforms regression-based optimization method by 2.8 points,
the non-variance-reduced baseline by 2.2 points on MATH-500 on Best-of-32
sampling experiment.
☆ Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment
The alignment of language models (LMs) with human preferences is critical for
building reliable AI systems. The problem is typically framed as optimizing an
LM policy to maximize the expected reward that reflects human preferences.
Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment
method that directly optimize the policy from static preference data, and
further improved by incorporating on-policy sampling (i.e., preference
candidates generated during the training loop) for better LM alignment.
However, we show on-policy data is not always optimal, with systematic
effectiveness difference emerging between static and on-policy preference
candidates. For example, on-policy data can result in a 3$\times$ effectiveness
compared with static data for Llama-3, and a 0.4$\times$ effectiveness for
Zephyr. To explain the phenomenon, we propose the alignment stage assumption,
which divides the alignment process into two distinct stages: the preference
injection stage, which benefits from diverse data, and the preference
fine-tuning stage, which favors high-quality data. Through theoretical and
empirical analysis, we characterize these stages and propose an effective
algorithm to identify the boundaries between them. We perform experiments on 5
models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO,
SLiC-HF) to show the generalizability of alignment stage assumption and
boundary measurement.
☆ Reverse Physician-AI Relationship: Full-process Clinical Diagnosis Driven by a Large Language Model
Full-process clinical diagnosis in the real world encompasses the entire
diagnostic workflow that begins with only an ambiguous chief complaint. While
artificial intelligence (AI), particularly large language models (LLMs), is
transforming clinical diagnosis, its role remains largely as an assistant to
physicians. This AI-assisted working pattern makes AI can only answer specific
medical questions at certain parts within the diagnostic process, but lack the
ability to drive the entire diagnostic process starting from an ambiguous
complaint, which still relies heavily on human physicians. This gap limits AI's
ability to fully reduce physicians' workload and enhance diagnostic efficiency.
To address this, we propose a paradigm shift that reverses the relationship
between physicians and AI: repositioning AI as the primary director, with
physicians serving as its assistants. So we present DxDirector-7B, an LLM
endowed with advanced deep thinking capabilities, enabling it to drive the
full-process diagnosis with minimal physician involvement. Furthermore,
DxDirector-7B establishes a robust accountability framework for misdiagnoses,
delineating responsibility between AI and human physicians. In evaluations
across rare, complex, and real-world cases under full-process diagnosis
setting, DxDirector-7B not only achieves significant superior diagnostic
accuracy but also substantially reduces physician workload than
state-of-the-art medical LLMs as well as general-purpose LLMs. Fine-grained
analyses across multiple clinical departments and tasks validate its efficacy,
with expert evaluations indicating its potential to serve as a viable
substitute for medical specialists. These findings mark a new era where AI,
traditionally a physicians' assistant, now drives the entire diagnostic process
to drastically reduce physicians' workload, indicating an efficient and
accurate diagnostic solution.
comment: 39 pages
☆ When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing AAAI
In the study of trustworthy Natural Language Processing (NLP), a number of
important research fields have emerged, including that of
\textit{explainability} and \textit{privacy}. While research interest in both
explainable and privacy-preserving NLP has increased considerably in recent
years, there remains a lack of investigation at the intersection of the two.
This leaves a considerable gap in understanding of whether achieving
\textit{both} explainability and privacy is possible, or whether the two are at
odds with each other. In this work, we conduct an empirical investigation into
the privacy-explainability trade-off in the context of NLP, guided by the
popular overarching methods of \textit{Differential Privacy} (DP) and Post-hoc
Explainability. Our findings include a view into the intricate relationship
between privacy and explainability, which is formed by a number of factors,
including the nature of the downstream task and choice of the text
privatization and explainability method. In this, we highlight the potential
for privacy and explainability to co-exist, and we summarize our findings in a
collection of practical recommendations for future work at this important
intersection.
comment: Accepted to AAAI/ACM Conference on AI, Ethics, and Society (AIES
2025)
☆ DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales
Generating textual rationales from large vision-language models (LVLMs) to
support trainable multimodal misinformation detectors has emerged as a
promising paradigm. However, its effectiveness is fundamentally limited by
three core challenges: (i) insufficient diversity in generated rationales, (ii)
factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting
content that introduces noise. We introduce DiFaR, a detector-agnostic
framework that produces diverse, factual, and relevant rationales to enhance
misinformation detection. DiFaR employs five chain-of-thought prompts to elicit
varied reasoning traces from LVLMs and incorporates a lightweight post-hoc
filtering module to select rationale sentences based on sentence-level
factuality and relevance scores. Extensive experiments on four popular
benchmarks demonstrate that DiFaR outperforms four baseline categories by up to
5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics
and human evaluations confirm that DiFaR significantly improves rationale
quality across all three dimensions.
☆ Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh
Large language models (LLMs) are limited by substantial computational cost.
We introduce a "computational economics" framework that treats an LLM as an
internal economy of resource-constrained agents (attention heads and neuron
blocks) that must allocate scarce computation to maximize task utility. First,
we show empirically that when computation is scarce, standard LLMs reallocate
attention toward high-value tokens while preserving accuracy. Building on this
observation, we propose an incentive-driven training paradigm that augments the
task loss with a differentiable computation cost term, encouraging sparse and
efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method
yields a family of models that trace a Pareto frontier and consistently
dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty
percent reduction in FLOPS and lower latency, together with more interpretable
attention patterns. These results indicate that economic principles offer a
principled route to designing efficient, adaptive, and more transparent LLMs
under strict resource constraints.
comment: Preprint; 7 figures, 4 tables, 1 algorithm. Experiments on GLUE
(MNLI, STS-B, CoLA) and WikiText-103 with BERT-base; evaluation includes
FLOPS, latency, Gini and entropy metrics
☆ Evaluating LLMs on Chinese Idiom Translation
Idioms, whose figurative meanings usually differ from their literal
interpretations, are common in everyday language, especially in Chinese, where
they often contain historical references and follow specific structural
patterns. Despite recent progress in machine translation with large language
models, little is known about Chinese idiom translation. In this work, we
introduce IdiomEval, a framework with a comprehensive error taxonomy for
Chinese idiom translation. We annotate 900 translation pairs from nine modern
systems, including GPT-4o and Google Translate, across four domains: web, news,
Wikipedia, and social media. We find these systems fail at idiom translation,
producing incorrect, literal, partial, or even missing translations. The
best-performing system, GPT-4, makes errors in 28% of cases. We also find that
existing evaluation metrics measure idiom quality poorly with Pearson
correlation below 0.48 with human ratings. We thus develop improved models that
achieve F$_1$ scores of 0.68 for detecting idiom translation errors.
comment: Accepted at COLM 2025
☆ ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Narrative comprehension on long stories and novels has been a challenging
domain attributed to their intricate plotlines and entangled, often evolving
relations among characters and entities. Given the LLM's diminished reasoning
over extended context and high computational cost, retrieval-based approaches
remain a pivotal role in practice. However, traditional RAG methods can fall
short due to their stateless, single-step retrieval process, which often
overlooks the dynamic nature of capturing interconnected relations within
long-range context. In this work, we propose ComoRAG, holding the principle
that narrative reasoning is not a one-shot process, but a dynamic, evolving
interplay between new evidence acquisition and past knowledge consolidation,
analogous to human cognition when reasoning with memory-related signals in the
brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes
iterative reasoning cycles while interacting with a dynamic memory workspace.
In each cycle, it generates probing queries to devise new exploratory paths,
then integrates the retrieved evidence of new aspects into a global memory
pool, thereby supporting the emergence of a coherent context for the query
resolution. Across four challenging long-context narrative benchmarks (200K+
tokens), ComoRAG outperforms strong RAG baselines with consistent relative
gains up to 11% compared to the strongest baseline. Further analysis reveals
that ComoRAG is particularly advantageous for complex queries requiring global
comprehension, offering a principled, cognitively motivated paradigm for
retrieval-based long context comprehension towards stateful reasoning. Our code
is publicly released at https://github.com/EternityJune25/ComoRAG
☆ CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model
Existing vision-and-language navigation models often deviate from the correct
trajectory when executing instructions. However, these models lack effective
error correction capability, hindering their recovery from errors. To address
this challenge, we propose Self-correction Flywheel, a novel post-training
paradigm. Instead of considering the model's error trajectories on the training
set as a drawback, our paradigm emphasizes their significance as a valuable
data source. We have developed a method to identify deviations in these error
trajectories and devised innovative techniques to automatically generate
self-correction data for perception and action. These self-correction data
serve as fuel to power the model's continued training. The brilliance of our
paradigm is revealed when we re-evaluate the model on the training set,
uncovering new error trajectories. At this time, the self-correction flywheel
begins to spin. Through multiple flywheel iterations, we progressively enhance
our monocular RGB-based VLA navigation model CorrectNav. Experiments on R2R-CE
and RxR-CE benchmarks show CorrectNav achieves new state-of-the-art success
rates of 65.1% and 69.3%, surpassing prior best VLA navigation models by 8.2%
and 16.4%. Real robot tests in various indoor and outdoor environments
demonstrate \method's superior capability of error correction, dynamic obstacle
avoidance, and long instruction following.
☆ Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
With the rapid proliferation of Natural Language Processing (NLP), especially
Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs
remains a key challenge for understanding model vulnerabilities and improving
robustness. In this context, we propose a new black-box attack method that
leverages the interpretability of large models. We introduce the Sparse Feature
Perturbation Framework (SFPF), a novel approach for adversarial text generation
that utilizes sparse autoencoders to identify and manipulate critical features
in text. After using the SAE model to reconstruct hidden layer representations,
we perform feature clustering on the successfully attacked texts to identify
features with higher activations. These highly activated features are then
perturbed to generate new adversarial texts. This selective perturbation
preserves the malicious intent while amplifying safety signals, thereby
increasing their potential to evade existing defenses. Our method enables a new
red-teaming strategy that balances adversarial effectiveness with safety
alignment. Experimental results demonstrate that adversarial texts generated by
SFPF can bypass state-of-the-art defense mechanisms, revealing persistent
vulnerabilities in current NLP systems.However, the method's effectiveness
varies across prompts and layers, and its generalizability to other
architectures and larger models remains to be validated.
★ Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Evaluating jailbreak attacks is challenging when prompts are not overtly
harmful or fail to induce harmful outputs. Unfortunately, many existing
red-teaming datasets contain such unsuitable prompts. To evaluate attacks
accurately, these datasets need to be assessed and cleaned for maliciousness.
However, existing malicious content detection methods rely on either manual
annotation, which is labor-intensive, or large language models (LLMs), which
have inconsistent accuracy in harmful types. To balance accuracy and
efficiency, we propose a hybrid evaluation framework named MDH (Malicious
content Detection based on LLMs with Human assistance) that combines LLM-based
annotation with minimal human oversight, and apply it to dataset cleaning and
detection of jailbroken responses. Furthermore, we find that well-crafted
developer messages can significantly boost jailbreak success, leading us to
propose two new strategies: D-Attack, which leverages context simulation, and
DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets,
judgements, and detection results will be released in github repository:
https://github.com/AlienZhang1996/DH-CoT.
☆ Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
While aspect-based sentiment analysis (ABSA) has made substantial progress,
challenges remain for low-resource languages, which are often overlooked in
favour of English. Current cross-lingual ABSA approaches focus on limited, less
complex tasks and often rely on external translation tools. This paper
introduces a novel approach using constrained decoding with
sequence-to-sequence models, eliminating the need for unreliable translation
tools and improving cross-lingual performance by 5\% on average for the most
complex task. The proposed method also supports multi-tasking, which enables
solving multiple ABSA tasks with a single model, with constrained decoding
boosting results by more than 10\%.
We evaluate our approach across seven languages and six ABSA tasks,
surpassing state-of-the-art methods and setting new benchmarks for previously
unexplored tasks. Additionally, we assess large language models (LLMs) in
zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in
zero-shot and few-shot settings, fine-tuning achieves competitive results
compared to smaller multilingual models, albeit at the cost of longer training
and inference times.
We provide practical recommendations for real-world applications, enhancing
the understanding of cross-lingual ABSA methodologies. This study offers
valuable insights into the strengths and limitations of cross-lingual ABSA
approaches, advancing the state-of-the-art in this challenging research domain.
☆ Large Language Models for Summarizing Czech Historical Documents and Beyond
Text summarization is the task of shortening a larger body of text into a
concise version while retaining its essential meaning and key information.
While summarization has been significantly explored in English and other
high-resource languages, Czech text summarization, particularly for historical
documents, remains underexplored due to linguistic complexities and a scarcity
of annotated datasets. Large language models such as Mistral and mT5 have
demonstrated excellent results on many natural language processing tasks and
languages. Therefore, we employ these models for Czech summarization, resulting
in two key contributions: (1) achieving new state-of-the-art results on the
modern Czech summarization dataset SumeCzech using these advanced models, and
(2) introducing a novel dataset called Posel od \v{C}erchova for summarization
of historical Czech documents with baseline results. Together, these
contributions provide a great potential for advancing Czech text summarization
and open new avenues for research in Czech historical text processing.
comment: Published in Proceedings of the 17th International Conference on
Agents and Artificial Intelligence - Volume 2 (ICAART 2025). Official
version: https://www.scitepress.org/Link.aspx?doi=10.5220/0013374100003890
☆ Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
Aspect-based sentiment analysis (ABSA) has made significant strides, yet
challenges remain for low-resource languages due to the predominant focus on
English. Current cross-lingual ABSA studies often centre on simpler tasks and
rely heavily on external translation tools. In this paper, we present a novel
sequence-to-sequence method for compound ABSA tasks that eliminates the need
for such tools. Our approach, which uses constrained decoding, improves
cross-lingual ABSA performance by up to 10\%. This method broadens the scope of
cross-lingual ABSA, enabling it to handle more complex tasks and providing a
practical, efficient alternative to translation-dependent techniques.
Furthermore, we compare our approach with large language models (LLMs) and show
that while fine-tuned multilingual LLMs can achieve comparable results,
English-centric LLMs struggle with these tasks.
comment: Published in Proceedings of the 17th International Conference on
Agents and Artificial Intelligence - Volume 2 (ICAART 2025). Official
version: https://www.scitepress.org/Link.aspx?doi=10.5220/0013349400003890
☆ Improving OCR for Historical Texts of Multiple Languages
This paper presents our methodology and findings from three tasks across
Optical Character Recognition (OCR) and Document Layout Analysis using advanced
deep learning techniques. First, for the historical Hebrew fragments of the
Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation
and employed the Kraken and TrOCR models to improve character recognition. In
our analysis of 16th to 18th-century meeting resolutions task, we utilized a
Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for
semantic segmentation with a Bidirectional LSTM, incorporating confidence-based
pseudolabeling to refine our model. Finally, for modern English handwriting
recognition task, we applied a CRNN with a ResNet34 encoder, trained using the
Connectionist Temporal Classification (CTC) loss function to effectively
capture sequential dependencies. This report offers valuable insights and
suggests potential directions for future research.
☆ Making Qwen3 Think in Korean with Reinforcement Learning
We present a two-stage fine-tuning approach to make the large language model
Qwen3 14B "think" natively in Korean. In the first stage, supervised
fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a
strong foundation in Korean logical reasoning, yielding notable improvements in
Korean-language tasks and even some gains in general reasoning ability. In the
second stage, we employ reinforcement learning with a customized Group Relative
Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning
alignment and overall problem-solving performance. We address critical
stability challenges in GRPO training - such as reward hacking and policy
collapse - by introducing an oracle judge model that calibrates the reward
signal. Our approach achieves stable learning (avoiding the collapse observed
in naive GRPO) and leads to steady, incremental performance gains. The final
RL-tuned model demonstrates substantially improved results on advanced
reasoning benchmarks (particularly math and coding tasks) while maintaining
knowledge and language proficiency, successfully conducting its internal
chain-of-thought entirely in Korean.
☆ Cross-Prompt Encoder for Low-Performing Languages
Soft prompts have emerged as a powerful alternative to adapters in
parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs)
to adapt to downstream tasks without architectural changes or parameter
updates. While prior work has focused on stabilizing training via parameter
interaction in small neural prompt encoders, their broader potential for
transfer across languages remains unexplored. In this paper, we demonstrate
that a prompt encoder can play a central role in improving performance on
low-performing languages-those that achieve poor accuracy even under full-model
fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a
lightweight encoding architecture with multi-source training on typologically
diverse languages - a design that enables the model to capture abstract and
transferable patterns across languages. To complement XPE, we propose a Dual
Soft Prompt mechanism that combines an encoder-based prompt with a directly
trained standard soft prompt. This hybrid design proves especially effective
for target languages that benefit from both broadly shared structure and
language-specific alignment. Experiments on the SIB-200 benchmark reveal a
consistent trade-off: XPE is most effective for low-performing languages, while
hybrid variants offer broader adaptability across multilingual settings.
☆ Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation
Recommender systems in concert with Large Language Models (LLMs) present
promising avenues for generating semantically-informed recommendations.
However, LLM-based recommenders exhibit a tendency to overemphasize semantic
correlations within users' interaction history. When taking pretrained
collaborative ID embeddings as input, LLM-based recommenders progressively
weaken the inherent collaborative signals as the embeddings propagate through
LLM backbones layer by layer, as opposed to traditional Transformer-based
sequential models in which collaborative signals are typically preserved or
even enhanced for state-of-the-art performance. To address this limitation, we
introduce FreLLM4Rec, an approach designed to balance semantic and
collaborative information from a spectral perspective. Item embeddings that
incorporate both semantic and collaborative information are first purified
using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant
high-frequency noise. Temporal Frequency Modulation (TFM) then actively
preserves collaborative signal layer by layer. Note that the collaborative
preservation capability of TFM is theoretically guaranteed by establishing a
connection between the optimal but hard-to-implement local graph fourier
filters and the suboptimal yet computationally efficient frequency-domain
filters. Extensive experiments on four benchmark datasets demonstrate that
FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves
competitive performance, with improvements of up to 8.00\% in NDCG@10 over the
best baseline. Our findings provide insights into how LLMs process
collaborative information and offer a principled approach for improving
LLM-based recommendation systems.
comment: 12 pages, 8 figures
☆ From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis ECAI-2025
Documents are core carriers of information and knowl-edge, with broad
applications in finance, healthcare, and scientific research. Tables, as the
main medium for structured data, encapsulate key information and are among the
most critical document components. Existing studies largely focus on
surface-level tasks such as layout analysis, table detection, and data
extraction, lacking deep semantic parsing of tables and their contextual
associations. This limits advanced tasks like cross-paragraph data
interpretation and context-consistent analysis. To address this, we propose
DOTABLER, a table-centric semantic document parsing framework designed to
uncover deep semantic links between tables and their context. DOTABLER
leverages a custom dataset and domain-specific fine-tuning of pre-trained
models, integrating a complete parsing pipeline to identify context segments
semantically tied to tables. Built on this semantic understanding, DOTABLER
implements two core functionalities: table-centric document structure parsing
and domain-specific table retrieval, delivering comprehensive table-anchored
semantic analysis and precise extraction of semantically relevant tables.
Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs,
DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior
performance in table-context semantic analysis and deep document parsing
compared to advanced models such as GPT-4o.
comment: 8 pages, 5 figures, 28th European Conference on Artificial
Intelligence (ECAI-2025)
☆ ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, Biqing Qi, Bowen Zhou
Peer review is essential for scientific progress but faces growing challenges
due to increasing submission volumes and reviewer fatigue. Existing automated
review approaches struggle with factual accuracy, rating consistency, and
analytical depth, often generating superficial or generic feedback lacking the
insights characteristic of high-quality human reviews. We introduce ReviewRL, a
reinforcement learning framework for generating comprehensive and factually
grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP
retrieval-augmented context generation pipeline that incorporates relevant
scientific literature, (2) supervised fine-tuning that establishes foundational
reviewing capabilities, and (3) a reinforcement learning procedure with a
composite reward function that jointly enhances review quality and rating
accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL
significantly outperforms existing methods across both rule-based metrics and
model-based quality assessments. ReviewRL establishes a foundational framework
for RL-driven automatic critique generation in scientific discovery,
demonstrating promising potential for future development in this domain. The
implementation of ReviewRL will be released at GitHub.
comment: 13 pages, 5 figures
☆ Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race
Gustavo Bonil, Simone Hashiguti, Jhessica Silva, João Gondim, Helena Maia, Nádia Silva, Helio Pedrini, Sandra Avila
With the advance of Artificial Intelligence (AI), Large Language Models
(LLMs) have gained prominence and been applied in diverse contexts. As they
evolve into more sophisticated versions, it is essential to assess whether they
reproduce biases, such as discrimination and racialization, while maintaining
hegemonic discourses. Current bias detection approaches rely mostly on
quantitative, automated methods, which often overlook the nuanced ways in which
biases emerge in natural language. This study proposes a qualitative,
discursive framework to complement such methods. Through manual analysis of
LLM-generated short stories featuring Black and white women, we investigate
gender and racial biases. We contend that qualitative methods such as the one
proposed here are fundamental to help both developers and users identify the
precise ways in which biases manifest in LLM outputs, thus enabling better
conditions to mitigate them. Results show that Black women are portrayed as
tied to ancestry and resistance, while white women appear in self-discovery
processes. These patterns reflect how language models replicate crystalized
discursive representations, reinforcing essentialization and a sense of social
immobility. When prompted to correct biases, models offered superficial
revisions that maintained problematic meanings, revealing limitations in
fostering inclusive narratives. Our results demonstrate the ideological
functioning of algorithms and have significant implications for the ethical use
and development of AI. The study reinforces the need for critical,
interdisciplinary approaches to AI design and deployment, addressing how
LLM-generated discourses reflect and perpetuate inequalities.
comment: 29 pages, 3 figures
☆ Inductive Bias Extraction and Matching for LLM Prompts
The active research topic of prompt engineering makes it evident that LLMs
are sensitive to small changes in prompt wording. A portion of this can be
ascribed to the inductive bias that is present in the LLM. By using an LLM's
output as a portion of its prompt, we can more easily create satisfactory
wording for prompts. This has the effect of creating a prompt that matches the
inductive bias in model. Empirically, we show that using this Inductive Bias
Extraction and Matching strategy improves LLM Likert ratings used for
classification by up to 19% and LLM Likert ratings used for ranking by up to
27%.
☆ A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona
This study explores language change and variation in Toki Pona, a constructed
language with approximately 120 core words. Taking a computational and
corpus-based approach, the study examines features including fluid word classes
and transitivity in order to examine (1) changes in preferences of content
words for different syntactic positions over time and (2) variation in usage
across different corpora. The results suggest that sociolinguistic factors
influence Toki Pona in the same way as natural languages, and that even
constructed linguistic systems naturally evolve as communities use them.
comment: 14 pages, 14 figures. submitted to UGA Working Papers in Linguistics
2025
♻ ☆ CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
Large Language Models (LLMs) have significantly advanced the state-of-the-art
in various coding tasks. Beyond directly answering user queries, LLMs can also
serve as judges, assessing and comparing the quality of responses generated by
other models. Such an evaluation capability is crucial both for benchmarking
different LLMs and for improving response quality through response ranking.
However, despite the growing adoption of the LLM-as-a-Judge paradigm, its
effectiveness in coding scenarios remains underexplored due to the absence of
dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a
benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge
models across three critical coding tasks: code generation, code repair, and
unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge
models, we find that recent thinking models significantly outperform
non-thinking models on our carefully designed code judging tasks. Notably, even
relatively small thinking models, such as Qwen3-8B, can outperform specially
trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still
exhibit significant randomness in their judgment of coding tasks. For pairwise
judging tasks, simply changing the order in which responses are presented can
substantially impact accuracy. In addition, when judging code and unit tests
written by different LLMs, LLM-as-a-Judge models also show variance in
performance. This sensitivity raises concerns about the reliability and
consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal
prompting strategies for LLM-as-a-Judge. We find that using pair-wise
comparison outperforms scalar point-wise judging. Furthermore, retaining
comments and reasoning in the full, unprocessed LLM response leads to improved
judge performance.
comment: Dataset is available at
https://huggingface.co/datasets/mattymchen/codejudgebench
♻ ☆ BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them
Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein
Understanding biases and stereotypes encoded in the weights of Large Language
Models (LLMs) is crucial for developing effective mitigation strategies. Biased
behaviour is often subtle and non-trivial to isolate, even when deliberately
elicited, making systematic analysis and debiasing particularly challenging. To
address this, we introduce BiasGym, a simple, cost-effective, and generalizable
framework for reliably injecting, analyzing, and mitigating conceptual
associations within LLMs. BiasGym consists of two components: BiasInject, which
injects specific biases into the model via token-based fine-tuning while
keeping the model frozen, and BiasScope, which leverages these injected signals
to identify and steer the components responsible for biased behavior. Our
method enables consistent bias elicitation for mechanistic analysis, supports
targeted debiasing without degrading performance on downstream tasks, and
generalizes to biases unseen during token-based fine-tuning. We demonstrate the
effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from
Italy being `reckless drivers') and in probing fictional associations (e.g.,
people from a fictional country having `blue skin'), showing its utility for
both safety interventions and interpretability research.
comment: Under review
♻ ☆ iFairy: the First 2-bit Complex LLM with All Parameters in $\{\pm1, \pm i\}$
Feiyu Wang, Guoan Wang, Yihao Zhang, Shengfan Wang, Weitao Li, Bokai Huang, Shimao Chen, Zihan Jiang, Rui Xu, Tong Yang
Quantization-Aware Training (QAT) integrates quantization into the training
loop, enabling LLMs to learn robust low-bit representations, and is widely
recognized as one of the most promising research directions. All current QAT
research focuses on minimizing quantization error on full-precision models,
where the full-precision accuracy acts as an upper bound (accuracy ceiling). No
existing method has even attempted to surpass this ceiling. To break this
ceiling, we propose a new paradigm: raising the ceiling (full-precision model),
and then still quantizing it efficiently into 2 bits. We propose Fairy$\pm i$,
the first 2-bit quantization framework for complex-valued LLMs. Specifically,
our method leverages the representational advantages of the complex domain to
boost full-precision accuracy. We map weights to the fourth roots of unity
$\{\pm1, \pm i\}$, forming a perfectly symmetric and information-theoretically
optimal 2-bit representation. Importantly, each quantized weight has either a
zero real or imaginary part, enabling multiplication-free inference using only
additions and element swaps. Experimental results show that Fairy$\pm i$
outperforms the ceiling of existing 2-bit quantization approaches in terms of
both PPL and downstream tasks, while maintaining strict storage and compute
efficiency. This work opens a new direction for building highly accurate and
practical LLMs under extremely low-bit constraints.
comment: 15 pages, 9 figures
♻ ☆ FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Large language models (LLMs) have been widely deployed with rapidly expanding
context windows to support increasingly demanding applications. However, long
contexts pose significant deployment challenges, primarily due to the KV cache
whose size grows proportionally with context length. While KV cache compression
methods are proposed to address this issue, KV dropping methods incur
considerable accuracy loss, and KV retrieval methods suffer from significant
efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization
framework to enhance KV retrieval efficiency while preserving accuracy. On the
algorithm side, FreeKV introduces speculative retrieval to shift the KV
selection and recall processes out of the critical path, combined with
fine-grained correction to ensure accuracy. On the system side, FreeKV employs
hybrid KV layouts across CPU and GPU memory to eliminate fragmented data
transfers, and leverages double-buffered streamed recall to further improve
efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy
across various scenarios and models, delivering up to 13$\times$ speedup
compared to SOTA KV retrieval methods.
♻ ☆ BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
The rise of long-context Large Language Models (LLMs) amplifies memory and
bandwidth demands during autoregressive decoding, as the Key-Value (KV) cache
grows with each generated token. Low-bit KV-cache quantization (e.g., 4-bit or
2-bit) can reduce memory footprint while preserving accuracy, but existing
systems suffer from slow decoding due to their exclusive reliance on CUDA
cores, neglecting Tensor Cores (the primary source of compute on modern GPUs).
We present BitDecoding, a new long-context LLM inference system with a low-bit
KV cache. BitDecoding enables efficient low-bit KV-cache decoding by
cooperatively leveraging CUDA cores and Tensor Cores. It introduces methods for
automatically inducing optimized layouts to exploit Tensor Cores, along with
warp-level parallelization strategies for dequantization. For unified system
support, BitDecoding includes a query transformation module supporting diverse
attention variants, a quantization kernel that supports both tensor-wise and
channel-wise scaling used in various quantization algorithms with high
performance, and a dequantization kernel with a software-defined pipeline to
coordinate CUDA and Tensor Cores execution for mixed-precision operations.
Evaluated on RTX 4090, A100, and H100, BitDecoding accelerates decoding by up
to 7.5x, 4.8x, and 8.9x, respectively, over FP16 FlashDecoding-v2, and
surpasses the state-of-the-art low-bit system QServe by up to 4.3x. On
LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding
latency by 3x, showing substantial improvements for long-context generation.
The code is available at https://github.com/DD-DuDa/BitDecoding.
♻ ☆ AF-MAT: Aspect-aware Flip-and-Fuse xLSTM for Aspect-based Sentiment Analysis
Aspect-based Sentiment Analysis (ABSA) is a crucial NLP task that extracts
fine-grained opinions and sentiments from text, such as product reviews and
customer feedback. Existing methods often trade off efficiency for performance:
traditional LSTM or RNN models struggle to capture long-range dependencies,
transformer-based methods are computationally costly, and Mamba-based
approaches rely on CUDA and weaken local dependency modeling. The recently
proposed Extended Long Short-Term Memory (xLSTM) model offers a promising
alternative by effectively capturing long-range dependencies through
exponential gating and enhanced memory variants, sLSTM for modeling local
dependencies, and mLSTM for scalable, parallelizable memory. However, xLSTM's
application in ABSA remains unexplored. To address this, we introduce
Aspect-aware Flip-and-Fuse xLSTM (AF-MAT), a framework that leverages xLSTM's
strengths. AF-MAT features an Aspect-aware matrix LSTM (AA-mLSTM) mechanism
that introduces a dedicated aspect gate, enabling the model to selectively
emphasize tokens semantically relevant to the target aspect during memory
updates. To model multi-scale context, we incorporate a FlipMix block that
sequentially applies a partially flipped Conv1D (pf-Conv1D) to capture
short-range dependencies in reverse order, followed by a fully flipped mLSTM
(ff-mLSTM) to model long-range dependencies via full sequence reversal.
Additionally, we propose MC2F, a lightweight Multihead Cross-Feature Fusion
based on mLSTM gating, which dynamically fuses AA-mLSTM outputs (queries and
keys) with FlipMix outputs (values) for adaptive representation integration.
Experiments on three benchmark datasets demonstrate that AF-MAT outperforms
state-of-the-art baselines, achieving higher accuracy in ABSA tasks.
comment: 9, 4 figure
♻ ☆ Sample-efficient LLM Optimization with Reset Replay
Recent advancements in post-training Large Language Models (LLMs),
particularly through Reinforcement Learning (RL) and preference optimization
methods, are key drivers for enhancing their reasoning capabilities. However,
these methods are often plagued by low sample efficiency and a susceptibility
to primacy bias, where overfitting to initial experiences degrades policy
quality and damages the learning process. To address these challenges, we
introduce LLM optimization with Reset Replay (LoRR), a general and powerful
plugin designed to enhance sample efficiency in any preference-based
optimization framework. LoRR core mechanism enables training at a high replay
number, maximizing the utility of each collected data batch. To counteract the
risk of overfitting inherent in high-replay training, LoRR incorporates a
periodic reset strategy with reusing initial data, which preserves network
plasticity. Furthermore, it leverages a hybrid optimization objective,
combining supervised fine-tuning (SFT) and preference-based losses to further
bolster data exploitation. Our extensive experiments demonstrate that LoRR
significantly boosts the performance of various preference optimization methods
on both mathematical and general reasoning benchmarks. Notably, an iterative
DPO approach augmented with LoRR achieves comparable performance on challenging
math tasks, outperforming some complex and computationally intensive RL-based
algorithms. These findings highlight that LoRR offers a practical,
sample-efficient, and highly effective paradigm for LLM finetuning, unlocking
greater performance from limited data.
♻ ☆ Knowledge-based Consistency Testing of Large Language Models EMNLP 2024
In this work, we systematically expose and measure the inconsistency and
knowledge gaps of Large Language Models (LLMs). Specifically, we propose an
automated testing framework (called KonTest) which leverages a knowledge graph
to construct test cases. KonTest probes and measures the inconsistencies in the
LLM's knowledge of the world via a combination of semantically-equivalent
queries and test oracles (metamorphic or ontological oracle). KonTest further
mitigates knowledge gaps via a weighted LLM model ensemble. Using four
state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that
KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test
inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A
mitigation method informed by KonTest's test suite reduces LLM knowledge gap by
32.48%. Our ablation study further shows that GPT3.5 is not suitable for
knowledge-based consistency testing because it is only 60%-68% effective in
knowledge construction.
comment: 12 pages, 4 figures, 8 tables, Accepted at EMNLP 2024 Findings
♻ ☆ Evaluation of Cultural Competence of Vision-Language Models
Srishti Yadav, Lauren Tilton, Maria Antoniak, Taylor Arnold, Jiaang Li, Siddhesh Milind Pawar, Antonia Karamolegkou, Stella Frank, Zhaochong An, Negar Rostamzadeh, Daniel Hershcovich, Serge Belongie, Ekaterina Shutova
Modern vision-language models (VLMs) often fail at cultural competency
evaluations and benchmarks. Given the diversity of applications built upon
VLMs, there is renewed interest in understanding how they encode cultural
nuances. While individual aspects of this problem have been studied, we still
lack a comprehensive framework for systematically identifying and annotating
the nuanced cultural dimensions present in images for VLMs. This position paper
argues that foundational methodologies from visual culture studies (cultural
studies, semiotics, and visual studies) are necessary for cultural analysis of
images. Building upon this review, we propose a set of five frameworks,
corresponding to cultural dimensions, that must be considered for a more
complete analysis of the cultural competencies of VLMs.
♻ ☆ Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition
This work presents a suite of fine-tuned Whisper models for Swedish, trained
on a dataset of unprecedented size and variability for this mid-resourced
language. As languages of smaller sizes are often underrepresented in
multilingual training datasets, substantial improvements in performance can be
achieved by fine-tuning existing multilingual models, as shown in this work.
This work reports an overall improvement across model sizes compared to
OpenAI's Whisper evaluated on Swedish. Most notably, we report an average 47%
reduction in WER comparing our best performing model to OpenAI's
whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.
comment: Accepted at Interspeech 2025
♻ ☆ Improved GUI Grounding via Iterative Narrowing
Graphical User Interface (GUI) grounding plays a crucial role in enhancing
the capabilities of Vision-Language Model (VLM) agents. While general VLMs,
such as GPT-4V, demonstrate strong performance across various tasks, their
proficiency in GUI grounding remains suboptimal. Recent studies have focused on
fine-tuning these models specifically for zero-shot GUI grounding, yielding
significant improvements over baseline performance. We introduce a visual
prompting framework that employs an iterative narrowing mechanism to further
improve the performance of both general and fine-tuned models in GUI grounding.
For evaluation, we tested our method on a comprehensive benchmark comprising
various UI platforms and provided the code to reproduce our results.
comment: Code available at
https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing
♻ ☆ Curse of High Dimensionality Issue in Transformer for Long-context Modeling ICML 2025
Transformer-based large language models (LLMs) excel in natural language
processing tasks by capturing long-range dependencies through self-attention
mechanisms. However, long-context modeling faces significant computational
inefficiencies due to \textit{redundant} attention computations: while
attention weights are often \textit{sparse}, all tokens consume \textit{equal}
computational resources. In this paper, we reformulate traditional
probabilistic sequence modeling as a \textit{supervised learning task},
enabling the separation of relevant and irrelevant tokens and providing a
clearer understanding of redundancy. Based on this reformulation, we
theoretically analyze attention sparsity, revealing that only a few tokens
significantly contribute to predictions. Building on this, we formulate
attention optimization as a linear coding problem and propose a \textit{group
coding strategy}, theoretically showing its ability to improve robustness
against random noise and enhance learning efficiency. Motivated by this, we
propose \textit{Dynamic Group Attention} (DGA), which leverages the group
coding to explicitly reduce redundancy by aggregating less important tokens
during attention computation. Empirical results show that our DGA significantly
reduces computational costs while maintaining competitive performance.Code is
available at https://github.com/bolixinyu/DynamicGroupAttention.
comment: Accepted at ICML 2025
♻ ☆ ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
The increasing scale and complexity of large language models (LLMs) pose
significant inference latency challenges, primarily due to their autoregressive
decoding paradigm characterized by the sequential nature of next-token
prediction. By re-examining the outputs of autoregressive models, we observed
that some segments exhibit parallelizable structures, which we term intrinsic
parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel
decoding) can significantly improve the overall inference speed of LLMs. In
this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which
addresses two core challenges: automated construction of parallelizable data
and efficient parallel decoding mechanism. More specifically, we introduce a
non-invasive pipeline that automatically extracts and validates parallelizable
structures from the responses of autoregressive models. To empower efficient
adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which
enables seamless transitions between serial and parallel decoding modes while
maintaining a reusable KV cache, maximizing computational efficiency. Extensive
evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical
Reasoning, demonstrate that ASPD achieves unprecedented performance in both
effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up
to 3.19x speedup (1.85x on average) while maintaining response quality within
1% difference compared to autoregressive models, realizing significant
acceleration without compromising generation quality. Our framework sets a
groundbreaking benchmark for efficient LLM parallel inference, paving the way
for its deployment in latency-sensitive applications such as AI-powered
customer service bots and answer retrieval engines.
comment: 20 pages, 9 figures
♻ ☆ Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models
Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki
Instruction tuning is crucial for enabling Large Language Models (LLMs) to
solve real-world tasks. Prior work has shown the effectiveness of
instruction-tuning data synthesized solely from LLMs, raising a fundamental
question: Do we still need human-originated signals for instruction tuning?
This work answers the question affirmatively: we build state-of-the-art
instruction-tuning datasets sourced from human-written instructions, by simply
pairing them with LLM-generated responses. LLMs fine-tuned on our datasets
consistently outperform those fine-tuned on existing ones. Our data
construction approach can be easily adapted to other languages; we build
datasets for Japanese and confirm that LLMs tuned with our data reach
state-of-the-art performance. Analyses suggest that instruction-tuning in a new
language allows LLMs to follow instructions, while the tuned models exhibit a
notable lack of culture-specific knowledge in that language. The datasets and
fine-tuned models will be publicly available. Our datasets, synthesized with
open-weight LLMs, are openly distributed under permissive licenses, allowing
for diverse use cases.
comment: COLM 2025; Datasets are available at
https://huggingface.co/datasets/tokyotech-llm/lmsys-chat-1m-synth
♻ ☆ TikZero: Zero-Shot Text-Guided Graphics Program Synthesis ICCV 2025
Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Automatically synthesizing figures from text captions is a compelling
capability. However, achieving high geometric precision and editability
requires representing figures as graphics programs in languages like TikZ, and
aligned training data (i.e., graphics programs with captions) remains scarce.
Meanwhile, large amounts of unaligned graphics programs and captioned raster
images are more readily available. We reconcile these disparate data sources by
presenting TikZero, which decouples graphics program generation from text
understanding by using image representations as an intermediary bridge. It
enables independent training on graphics programs and captioned images and
allows for zero-shot text-guided graphics program synthesis during inference.
We show that our method substantially outperforms baselines that can only
operate with caption-aligned graphics programs. Furthermore, when leveraging
caption-aligned graphics programs as a complementary training signal, TikZero
matches or exceeds the performance of much larger models, including commercial
systems like GPT-4o. Our code, datasets, and select models are publicly
available.
comment: Accepted at ICCV 2025 (highlight); Project page:
https://github.com/potamides/DeTikZify
♻ ☆ Meanings are like Onions: a Layered Approach to Metaphor Processing
Metaphorical meaning is not a flat mapping between concepts, but a complex
cognitive phenomenon that integrates multiple levels of interpretation. In this
paper, we propose a stratified model of metaphor processing that treats meaning
as an onion: a multi-layered structure comprising (1) content analysis, (2)
conceptual blending, and (3) pragmatic intentionality. This three-dimensional
framework allows for a richer and more cognitively grounded approach to
metaphor interpretation in computational systems. At the first level, metaphors
are annotated through basic conceptual elements. At the second level, we model
conceptual combinations, linking components to emergent meanings. Finally, at
the third level, we introduce a pragmatic vocabulary to capture speaker intent,
communicative function, and contextual effects, aligning metaphor understanding
with pragmatic theories. By unifying these layers into a single formal
framework, our model lays the groundwork for computational methods capable of
representing metaphorical meaning beyond surface associations, toward deeper,
more context-sensitive reasoning.
♻ ☆ DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base
Large Language Models (LLMs) have demonstrated remarkable capabilities in
various applications. However, their use as writing assistants in specialized
domains like finance, medicine, and law is often hampered by a lack of deep
domain-specific knowledge and a tendency to hallucinate. Existing solutions,
such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency
across multiple retrieval steps, while online search-based methods often
degrade quality due to unreliable web content. To address these challenges, we
introduce DeepWriter, a customizable, multimodal, long-form writing assistant
that operates on a curated, offline knowledge base. DeepWriter leverages a
novel pipeline that involves task decomposition, outline generation, multimodal
retrieval, and section-by-section composition with reflection. By deeply mining
information from a structured corpus and incorporating both textual and visual
elements, DeepWriter generates coherent, factually grounded, and
professional-grade documents. We also propose a hierarchical knowledge
representation to enhance retrieval efficiency and accuracy. Our experiments on
financial report generation demonstrate that DeepWriter produces high-quality,
verifiable articles that surpasses existing baselines in factual accuracy and
generated content quality.
comment: work in process
♻ ☆ LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang
Large reasoning models have achieved remarkable performance through extended
chain-of-thought sequences, yet this computational freedom leads to excessive
token generation even for simple problems. We present Length-Adaptive Policy
Optimization (LAPO), a novel framework that transforms reasoning length control
from an external constraint into an intrinsic model capability. Unlike existing
approaches that impose rigid limits or rely on post-hoc interventions, LAPO
enables models to internalize an understanding of appropriate reasoning depth
through a two-stage reinforcement learning process. In the first stage, models
learn natural reasoning patterns by discovering the statistical distribution of
successful solution lengths. The second stage leverages these patterns as
meta-cognitive guidance, embedding them directly within the model's reasoning
context to ensure inference-time flexibility. Experiments on mathematical
reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9%
while improving accuracy by 2.3%. Our analysis reveals that models trained with
LAPO develop emergent abilities to allocate computational resources based on
problem complexity, achieving efficient reasoning without sacrificing quality.
comment: GitHub:https://github.com/zju-real/lapoProject:https://zju-real.github.io/lapo
♻ ☆ Measuring Diversity in Synthetic Datasets ICML 2025
Large language models (LLMs) are widely adopted to generate synthetic
datasets for various natural language processing (NLP) tasks, such as text
classification and summarization. However, accurately measuring the diversity
of these synthetic datasets-an aspect crucial for robust model
performance-remains a significant challenge. In this paper, we introduce
DCScore, a novel method for measuring synthetic dataset diversity from a
classification perspective. Specifically, DCScore formulates diversity
evaluation as a sample classification task, leveraging mutual relationships
among samples. We further provide theoretical verification of the
diversity-related axioms satisfied by DCScore, highlighting its role as a
principled diversity evaluation method. Experimental results on synthetic
datasets reveal that DCScore enjoys a stronger correlation with multiple
diversity pseudo-truths of evaluated datasets, underscoring its effectiveness.
Moreover, both empirical and theoretical evidence demonstrate that DCScore
substantially reduces computational costs compared to existing methods. Code is
available at: https://github.com/bluewhalelab/dcscore.
comment: Accepted by ICML 2025
♻ ★ LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint ACL2025
Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks
incurs substantial computational and data costs. While model merging offers a
training-free solution to integrate multiple task-specific models, existing
methods suffer from safety-utility conflicts where enhanced general
capabilities degrade safety safeguards. We identify two root causes:
$\textbf{neuron misidentification}$ due to simplistic parameter magnitude-based
selection, and $\textbf{cross-task neuron interference}$ during merging. To
address these challenges, we propose $\textbf{LED-Merging}$, a three-stage
framework that $\textbf{L}$ocates task-specific neurons via gradient-based
attribution, dynamically $\textbf{E}$lects critical neurons through multi-model
importance fusion, and $\textbf{D}$isjoints conflicting updates through
parameter isolation. Extensive experiments on Llama-3-8B, Mistral-7B, and
Llama2-13B demonstrate that LED-Merging effectively reduces harmful response
rates, showing a 31.4\% decrease on Llama-3-8B-Instruct on HarmBench, while
simultaneously preserving 95\% of utility performance, such as achieving
52.39\% accuracy on GSM8K. LED-Merging resolves safety-utility conflicts and
provides a lightweight, training-free paradigm for constructing reliable
multi-task LLMs. Code is available at
$\href{https://github.com/MqLeet/LED-Merging}{GitHub}$.
comment: Accepted by ACL2025 main conference
♻ ☆ Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free
Large Language Models (LLMs) face deployment challenges due to high
computational costs, and while Post-Training Quantization (PTQ) offers a
solution, existing rotation-based methods struggle at very low bit-widths like
2-bit. We introduce a novel, training-free approach to construct an improved
rotation matrix, addressing the limitations of current methods. The key
contributions include leveraging the Walsh-Hadamard transform with sequency
ordering, which clusters similar frequency components to reduce quantization
error compared to standard Hadamard matrices, significantly improving
performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR)
using block-diagonal matrices with smaller Walsh blocks, effectively isolating
outlier impacts and achieving performance comparable to optimization-based
methods without requiring any training. Our method demonstrates robust
performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our
method also enhances results even when applied over existing learned rotation
techniques.
comment: 7 pages
♻ ☆ Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
Artificial Intelligence (AI) conferences are essential for advancing
research, sharing knowledge, and fostering academic community. However, their
rapid expansion has rendered the centralized conference model increasingly
unsustainable. This paper offers a data-driven diagnosis of a structural crisis
that threatens the foundational goals of scientific dissemination, equity, and
community well-being. We identify four key areas of strain: (1) scientifically,
with per-author publication rates more than doubling over the past decade to
over 4.5 papers annually; (2) environmentally, with the carbon footprint of a
single conference exceeding the daily emissions of its host city; (3)
psychologically, with 71% of online community discourse reflecting negative
sentiment and 35% referencing mental health concerns; and (4) logistically,
with attendance at top conferences such as NeurIPS 2024 beginning to outpace
venue capacity. These pressures point to a system that is misaligned with its
core mission. In response, we propose the Community-Federated Conference (CFC)
model, which separates peer review, presentation, and networking into globally
coordinated but locally organized components, offering a more sustainable,
inclusive, and resilient path forward for AI research.
comment: Preprint
♻ ☆ Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback CIKM '25
Accurate personalized headline generation hinges on precisely capturing user
interests from historical behaviors. However, existing methods neglect
personalized-irrelevant click noise in entire historical clickstreams, which
may lead to hallucinated headlines that deviate from genuine user preferences.
In this paper, we reveal the detrimental impact of click noise on personalized
generation quality through rigorous analysis in both user and news dimensions.
Based on these insights, we propose a novel Personalized Headline Generation
framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF).
PHG-DIF first employs dual-stage filtering to effectively remove clickstream
noise, identified by short dwell times and abnormal click bursts, and then
leverages multi-level temporal fusion to dynamically model users' evolving and
multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a
new benchmark dataset comprising the click behavior of 1,000 carefully curated
users and nearly 10,000 annotated personalized headlines with historical dwell
time annotations. Extensive experiments demonstrate that PHG-DIF substantially
mitigates the adverse effects of click noise and significantly improves
headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our
framework implementation and dataset are available at
https://github.com/liukejin-up/PHG-DIF.
comment: Accepted by the 34th ACM International Conference on Information and
Knowledge Management (CIKM '25), Full Research Papers track
♻ ☆ A New Query Expansion Approach via Agent-Mediated Dialogic Inquiry KDD 2025
Query expansion is widely used in Information Retrieval (IR) to improve
search outcomes by supplementing initial queries with richer information. While
recent Large Language Model (LLM) based methods generate pseudo-relevant
content and expanded terms via multiple prompts, they often yield homogeneous,
narrow expansions that lack the diverse context needed to retrieve relevant
information. In this paper, we propose AMD: a new Agent-Mediated Dialogic
Framework that engages in a dialogic inquiry involving three specialized roles:
(1) a Socratic Questioning Agent reformulates the initial query into three
sub-questions, with each question inspired by a specific Socratic questioning
dimension, including clarification, assumption probing, and implication
probing, (2) a Dialogic Answering Agent generates pseudo-answers, enriching the
query representation with multiple perspectives aligned to the user's intent,
and (3) a Reflective Feedback Agent evaluates and refines these pseudo-answers,
ensuring that only the most relevant and informative content is retained. By
leveraging a multi-agent process, AMD effectively crafts richer query
representations through inquiry and feedback refinement. Extensive experiments
on benchmarks including BEIR and TREC demonstrate that our framework
outperforms previous methods, offering a robust solution for retrieval tasks.
comment: Accepted by ACM SIGKDD 2025 Workshop on AI Agent for Information
Retrieval (Agent4IR)
♻ ☆ Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
In this work, we demonstrate that certain machine unlearning methods may fail
under straightforward prompt attacks. We systematically evaluate eight
unlearning techniques across three model families using output-based,
logit-based, and probe analysis to assess the extent to which supposedly
unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit
robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g.,
prepending Hindi filler text to the original prompt recovers 57.3% accuracy).
Our logit analysis further indicates that unlearned models are unlikely to hide
knowledge through changes in answer formatting, given the strong correlation
between output and logit accuracy. These findings challenge prevailing
assumptions about unlearning effectiveness and highlight the need for
evaluation frameworks that can reliably distinguish between genuine knowledge
removal and superficial output suppression. To facilitate further research, we
publicly release our evaluation framework to easily evaluate prompting
techniques to retrieve unlearned knowledge.
comment: 19 pages, 6 figures. Accepted at COLM 2025 SoLaR Workshop
♻ ☆ Echoes of Automation: The Increasing Use of LLMs in Newsmaking
The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns
for journalistic integrity and authorship. This study examines AI-generated
content across over 40,000 news articles from major, local, and college news
media, in various media formats. Using three advanced AI-text detectors (e.g.,
Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of
GenAI use in recent years, especially in local and college news. Sentence-level
analysis reveals LLMs are often used in the introduction of news, while
conclusions usually written manually. Linguistic analysis shows GenAI boosts
word richness and readability but lowers formality, leading to more uniform
writing styles, particularly in local media.
comment: To appear in the SBP-BRiMS 2025
♻ ☆ ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning
Xingshan Zeng, Weiwen Liu, Xu Huang, Zezhong Wang, Lingzhi Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Ruiming Tang, Qun Liu
Tool learning, which allows Large Language Models (LLMs) to leverage external
tools for solving complex user tasks, has emerged as a promising avenue for
extending model capabilities. However, existing approaches primarily focus on
data synthesis for fine-tuning LLMs to invoke tools effectively, largely
ignoring how to fully stimulate the potential of the model. In this paper, we
propose ToolACE-R, a novel framework that includes both model-aware iterative
training and adaptive refinement for tool learning. ToolACE-R features a
model-aware iterative training procedure that progressively adjust training
samples based on the model's evolving capabilities to maximize its potential.
Additionally, it incorporates self-refinement training corpus which emphasizes
LLM's ability to iteratively refine their tool calls, optimizing performance
without requiring external feedback. Furthermore, we introduce adaptive
self-refinement mechanism for efficient test-time scaling, where the trained
model can autonomously determine when to stop the process based on iterative
self-refinement. We conduct extensive experiments across several benchmark
datasets, showing that ToolACE-R achieves competitive performance compared to
advanced API-based models. The performance of tool invocation can be further
improved efficiently through adaptive self-refinement. These results highlight
the effectiveness and generalizability of ToolACE-R, offering a promising
direction for more efficient and scalable tool learning.
♻ ☆ A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models
Sadia Kamal, Lalu Prasad Yadav Prakash, S M Rafiuddin, Mohammed Rakib, Atriya Sen, Sagnik Ray Choudhury
Political Compass Test (PCT) or similar questionnaires have been used to
quantify LLM's political leanings. Building on a recent line of work that
examines the validity of PCT tests, we demonstrate that variation in standard
generation parameters does not significantly impact the models' PCT scores.
However, external factors such as prompt variations and fine-tuning
individually and in combination affect the same. Finally, we demonstrate that
when models are fine-tuned on text datasets with higher political content than
others, the PCT scores are not differentially affected. This calls for a
thorough investigation into the validity of PCT and similar tests, as well as
the mechanism by which political leanings are encoded in LLMs.
♻ ☆ PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
We introduce PRELUDE, a benchmark for evaluating long-context understanding
through the task of determining whether a character's prequel story is
consistent with the canonical narrative of the original book. Our task poses a
stronger demand for global comprehension and deep reasoning than existing
benchmarks -- as the prequels are not part of the original story, assessing
their plausibility typically requires searching and integrating information
that is only indirectly related. Empirically, 88% of instances require evidence
from multiple parts of the narrative. Experimental results highlight the
challenge of our task: in-context learning, RAG and in-domain training with
state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans
by >15%. A further human study reveals that models often produce correct
answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy
compared to humans. These findings underscore the substantial room for
improvement in long-context understanding and reasoning.
comment: First 7 authors contributed equally. Project page:
https://gorov.github.io/prelude
♻ ☆ Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Large language models (LLMs) have transformed sentiment analysis, yet
balancing accuracy, efficiency, and explainability remains a critical
challenge. This study presents the first comprehensive evaluation of
DeepSeek-R1--an open-source reasoning model--against OpenAI's GPT-4o and
GPT-4o-mini. We test the full 671B model and its distilled variants,
systematically documenting few-shot learning curves. Our experiments show
DeepSeek-R1 achieves a 91.39\% F1 score on 5-class sentiment and 99.31\%
accuracy on binary tasks with just 5 shots, an eightfold improvement in
few-shot efficiency over GPT-4o. Architecture-specific distillation effects
emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant
by 6.69 percentage points. While its reasoning process reduces throughput,
DeepSeek-R1 offers superior explainability via transparent, step-by-step
traces, establishing it as a powerful, interpretable open-source alternative.
comment: 10 pages, 2 figures, 6 tables, revised and re-submitted to an IEEE
journal
♻ ☆ Marco-Voice Technical Report
Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
This paper presents a multifunctional speech synthesis system that integrates
voice cloning and emotion control speech synthesis within a unified framework.
The goal of this work is to address longstanding challenges in achieving highly
expressive, controllable, and natural speech generation that faithfully
preserves speaker identity across diverse linguistic and emotional contexts.
Our approach introduces an effective speaker-emotion disentanglement mechanism
with in-batch contrastive learning, enabling independent manipulation of
speaker identity and eemotional style, as well as rotational emotional
embedding integration method for smooth emotion control. To support
comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality
emotional speech dataset containing 10 hours of Mandarin speech from six
professional speakers across seven emotional categories. Extensive experiments
demonstrate that our system, Marco-Voice, achieves substantial improvements in
both objective and subjective metrics. Comprehensive evaluations and analysis
were conducted, results show that MarcoVoice delivers competitive performance
in terms of speech clarity and emotional richness, representing a substantial
advance in the field of expressive neural speech synthesis. Our code and
dataset are publicly available at https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively.
comment: Technical Report. Our code and dataset are publicly available at
https://github.com/AIDC-AI/Marco-Voice and
https://huggingface.co/datasets/AIDC-AI/CSEMOTIONS respectively
♻ ☆ MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models
Electronic health record (EHR) foundation models have been an area ripe for
exploration with their improved performance in various medical tasks. Despite
the rapid advances, there exists a fundamental limitation: Processing unseen
medical codes out of vocabulary. This problem limits the generalizability of
EHR foundation models and the integration of models trained with different
vocabularies. To alleviate this problem, we propose a set of novel medical
concept representations (MedRep) for EHR foundation models based on the
observational medical outcome partnership (OMOP) common data model (CDM). For
concept representation learning, we enrich the information of each concept with
a minimal definition through large language model (LLM) prompts and complement
the text-based representations through the graph ontology of OMOP vocabulary.
Our approach outperforms the vanilla EHR foundation model and the model with a
previously introduced medical code tokenizer in diverse prediction tasks. We
also demonstrate the generalizability of MedRep through external validation.
comment: 18 pages
♻ ☆ Performance of GPT-5 Frontier Models in Ophthalmology Question Answering
Fares Antaki, David Mikhail, Daniel Milad, Danny A Mammo, Sumit Sharma, Sunil K Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval
Large language models (LLMs) such as GPT-5 integrate advanced reasoning
capabilities that may improve performance on complex medical question-answering
tasks. For this latest generation of reasoning models, the configurations that
maximize both accuracy and cost-efficiency have yet to be established. We
evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across
four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using
260 closed-access multiple-choice questions from the American Academy of
Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome
was multiple-choice accuracy; secondary outcomes included head-to-head ranking
via a Bradley-Terry model, rationale quality assessment using a
reference-anchored, pairwise LLM-as-a-judge framework, and analysis of
accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved
the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano
variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high
(0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x
stronger than o3-high) and rationale quality (1.11x stronger than o3-high).
Cost-accuracy analysis identified several GPT-5 configurations on the Pareto
frontier, with GPT-5-mini-low offering the most favorable low-cost,
high-performance balance. These results benchmark GPT-5 on a high-quality
ophthalmology dataset, demonstrate the influence of reasoning effort on
accuracy, and introduce an autograder framework for scalable evaluation of
LLM-generated answers against reference standards in ophthalmology.
♻ ☆ Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models
Expanding the abbreviated column names of tables, such as "esal" to "employee
salary", is critical for numerous downstream data tasks. This problem arises in
enterprises, domain sciences, government agencies, and more. In this paper we
make three contributions that significantly advances the state of the art.
First, we show that synthetic public data used by prior work has major
limitations, and we introduce 4 new datasets in enterprise/science domains,
with real-world abbreviations. Second, we show that accuracy measures used by
prior work seriously undercount correct expansions, and we propose new
synonym-aware measures that capture accuracy much more accurately. Finally, we
develop Columbo, a powerful LLM-based solution that exploits context, rules,
chain-of-thought reasoning, and token-level analysis. Extensive experiments
show that Columbo significantly outperforms NameGuess, the current most
advanced solution, by 4-29%, over 5 datasets. Columbo has been used in
production on EDI, a major data portal for environmental sciences.
♻ ★ Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen
Large Language Models (LLMs) hold promise in automating data analysis tasks,
yet open-source models face significant limitations in these kinds of
reasoning-intensive scenarios. In this work, we investigate strategies to
enhance the data analysis capabilities of open-source LLMs. By curating a seed
dataset of diverse, realistic scenarios, we evaluate model behavior across
three core dimensions: data understanding, code generation, and strategic
planning. Our analysis reveals three key findings: (1) Strategic planning
quality serves as the primary determinant of model performance; (2) Interaction
design and task complexity significantly influence reasoning capabilities; (3)
Data quality demonstrates a greater impact than diversity in achieving optimal
performance. We leverage these insights to develop a data synthesis
methodology, demonstrating significant improvements in open-source LLMs'
analytical reasoning capabilities. Code is available at
https://github.com/zjunlp/DataMind.
comment: Work in progress